josnakhatun josnakhatun's profile

Even if the timeline is further relaxed, outages

Even if the timeline is further relaxed, outages among major Internet companies have occurred quite frequently in the first half of this year.
On March 29, QQ and WeChat failed one after another, and many functions such as payment, voice calls, Moments, and QQ HE Tuber Space were used abnormally. Tencent characterized this as a first-level accident, including Lu Shan, President of the Technology Engineering Business Group, and WeChat Business Group Several senior executives, including Vice President Zhou Hao, were criticized internally. Coincidentally, Vipshop also suffered a P0 level outage on the same day, and the head of the basic platform department was dismissed as a result.
Secondly, the simultaneous occurrence of frequent outages, long-term cost reduction, efficiency improvement, and large-scale layoffs inevitably makes people suspect that there is some subtle connection between the two.

To judge whether this statement is reasonable, we must first understand the cause of the cloud server outage.

Generally speaking, the reasons for server downtime can be divided into two categories. One is human failure, such as system failure, design loopholes, and short-term overload operation. The other is non-human factors such as extreme weather and temporary power outages in the area. The outages of Vipshop, Tencent and Alibaba Cloud are typical human faults. The direct cause of the three incidents is that the abnormality of the cooling system in the computer room caused the equipment temperature to heat up rapidly.

 This is also the most common cause of failure in the industry.

Generally speaking, no or non-human factors can be completely avoided, disaster recovery and disaster preparedness plans are necessary. Alibaba Cloud also admitted in its subsequent response that the failure to handle the accident site in a timely manner caused the sprinkler system to be triggered, and the failure to release fault information in a timely manner were important reasons for amplifying the impact of the outage.
It was this response that led some people in the industry to uncover problems: downsizing staff, laying off high-paid senior programmers, and relying too much on young people. There was also no emergency and preventive measures such as dual-machine hot backup plans, backup computer rooms, and multi-node clusters. , are one of the reasons for exacerbating the impact of downtime - and are also the sequelae of cost reduction and efficiency improvement.
Even if the timeline is further relaxed, outages
Published:

Even if the timeline is further relaxed, outages

Published:

Creative Fields